Fix error after remote tidy timeout #4465

MetRonnie · 2021-10-12T16:49:24Z

This is a small change with no associated Issue.

Fix a ValueError that could occur during remote tidy during shutdown:

2021-10-11T11:38:10Z ERROR - Error during shutdown
2021-10-11T11:38:10Z ERROR - too many values to unpack (expected 2)
	Traceback (most recent call last):
	  File "/home/runner/work/cylc-flow/cylc-flow/cylc/flow/scheduler.py", line 1613, in shutdown
	    await self._shutdown(reason)
	  File "/home/runner/work/cylc-flow/cylc-flow/cylc/flow/scheduler.py", line 1708, in _shutdown
	    self.task_job_mgr.task_remote_mgr.remote_tidy()
	  File "/home/runner/work/cylc-flow/cylc-flow/cylc/flow/task_remote_mgr.py", line 347, in remote_tidy
	    for platform_n, (cmd, proc) in procs.items():
	ValueError: too many values to unpack (expected 2)

I think that might have been why _remote_background_indep_poll was so flaky on GH Actions.

In the process I refactored TaskRemoteMgr.remote_tidy() to use a queue of processes instead of a dict, much like how remote clean works.

Requirements check-list

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.py and conda-environment.yml.
Does not need tests.
Appropriate change log entry included.
No documentation update required.

MetRonnie · 2021-10-12T16:52:34Z

cylc/flow/task_remote_mgr.py

        # Terminate any remaining commands
-        for platform_n, (cmd, proc) in procs.items():


This was the offending line

hjoliver · 2021-10-12T21:59:47Z

I think that might have been by _remote_background_indep_poll was so flaky on GH Actions.

Nice, kudos if that works out 🥇

cylc/flow/task_remote_mgr.py

oliver-sanders · 2021-10-13T09:27:45Z

cylc/flow/task_remote_mgr.py

-                    self.bad_hosts.add(host)
+        while queue and time() < timeout:
+            item = queue.popleft()
+            if item.proc.poll() is None:  # proc still running


Could use proc.wait(timeout=10) to avoid this polling loop and the subsequent termination loop e.g:

for proc in procs: try: if proc.wait(10): ... else: ... except subprocess.TimeoutExpired: ...

I think that would negate the concurrency in the way the queue is currently handled?

The processes would still be concurrent, however, the result retrieval would be sequential:

from subprocess import * from time import time procs = [ Popen(['sleep', '2']) for _ in range(5) ] start = time() for proc in procs: proc.wait() print(f'{time() - start}')

$ python test.py 2.007575035095215

I see, but considering that retries for SSH 255 failures are queued after result retrieval, it would still slow it down in those cases, I'd have thought

Ah, damn, adding to the list whilst processing it makes a mess of things.

cylc/flow/task_remote_mgr.py

MetRonnie · 2021-10-13T12:02:59Z

Number of failed tests (before retry) now a lot less.

Before:

Test Summary Report
-------------------
tests/k/cylc-poll/16-execution-time-limit.t                                    (Wstat: 0 Tests: 4 Failed: 3)
  Failed tests:  2-4
tests/f/cylc-clean/01-remote.t                                                 (Wstat: 0 Tests: 10 Failed: 1)
  Failed test:  2
tests/f/cylc-ping/04-check-keys-remote.t                                       (Wstat: 0 Tests: 5 Failed: 2)
  Failed tests:  2, 5
tests/f/platforms/02-host-to-platform-upgrade.t                                (Wstat: 0 Tests: 6 Failed: 1)
  Failed test:  5
tests/f/job-submission/09-activity-log-host-bad-submit.t                       (Wstat: 0 Tests: 2 Failed: 1)
  Failed test:  2
tests/f/job-submission/02-job-nn-remote-host.t                                 (Wstat: 0 Tests: 2 Failed: 1)
  Failed test:  2
tests/f/intelligent-host-selection/03-polling.t                                (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  2
tests/f/events/10-task-event-job-logs-retrieve.t                               (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  2
tests/f/job-submission/17-remote-localtime.t                                   (Wstat: 0 Tests: 2 Failed: 1)
  Failed test:  2
tests/f/events/17-task-event-job-logs-retrieve-command.t                       (Wstat: 0 Tests: 3 Failed: 1)
  Failed test:  2
tests/f/database/03-remote.t                                                   (Wstat: 0 Tests: 3 Failed: 1)
  Failed test:  2
tests/f/events/33-task-event-job-logs-retrieve-3.t                             (Wstat: 0 Tests: 5 Failed: 2)
  Failed tests:  3-4
tests/f/job-submission/13-tidy-submits-of-prev-run-remote-host.t               (Wstat: 0 Tests: 11 Failed: 2)
  Failed tests:  2, 7
tests/f/cylc-cat-log/11-remote-retrieve.t                                      (Wstat: 0 Tests: 7 Failed: 1)
  Failed test:  2
tests/f/events/11-cycle-task-event-job-logs-retrieve.t                         (Wstat: 0 Tests: 3 Failed: 2)
  Failed tests:  2-3
tests/f/remote/06-poll.t                                                       (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  2
tests/f/remote/00-basic.t                                                      (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  2
tests/f/restart/26-remote-kill.t                                               (Wstat: 0 Tests: 5 Failed: 1)
  Failed test:  2
tests/f/cylc-cat-log/10-remote-no-retrieve.t                                   (Wstat: 0 Tests: 5 Failed: 1)
  Failed test:  2
tests/f/authentication/01-remote-workflow-same-name.t                          (Wstat: 0 Tests: 3 Failed: 1)
  Failed test:  3
tests/f/cylc-cat-log/08-editor-remote.t                                        (Wstat: 0 Tests: 16 Failed: 1)
  Failed test:  2

After:

Test Summary Report
-------------------
tests/f/cylc-ping/04-check-keys-remote.t                                       (Wstat: 0 Tests: 5 Failed: 1)
  Failed test:  5
tests/f/events/33-task-event-job-logs-retrieve-3.t                             (Wstat: 0 Tests: 5 Failed: 2)
  Failed tests:  3-4
tests/f/events/17-task-event-job-logs-retrieve-command.t                       (Wstat: 0 Tests: 3 Failed: 2)
  Failed tests:  2-3
tests/f/remote/06-poll.t                                                       (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  3
tests/k/cylc-poll/16-execution-time-limit.t                                    (Wstat: 0 Tests: 4 Failed: 2)
  Failed tests:  3-4
tests/f/events/10-task-event-job-logs-retrieve.t                               (Wstat: 0 Tests: 4 Failed: 3)
  Failed tests:  2-4
tests/f/events/11-cycle-task-event-job-logs-retrieve.t                         (Wstat: 0 Tests: 3 Failed: 2)
  Failed tests:  2-3

datamel

I have checked out the branch, read the code. Only issue I have found is the purge has been commented out in the ihs test.

tests/functional/intelligent-host-selection/01-periodic-clear-badhosts.t

Avoid maxing out CPU!

hjoliver

👍 thanks @MetRonnie

MetRonnie added the bug? Not sure if this is a bug or not label Oct 12, 2021

MetRonnie added this to the cylc-8.0rc1 milestone Oct 12, 2021

MetRonnie requested review from oliver-sanders and datamel October 12, 2021 16:49

MetRonnie self-assigned this Oct 12, 2021

MetRonnie commented Oct 12, 2021

View reviewed changes

MetRonnie modified the milestones: cylc-8.0rc1, cylc-8.0b3 Oct 12, 2021

MetRonnie added 3 commits October 12, 2021 17:57

Fix ValueError during shutdown; refactor remote_tidy()

4dc5209

Fix occasional lack of install target set in platform during remote tidy

a64c85b

Update changelog

b40a558

MetRonnie force-pushed the remote-tidy-fix branch from 4818895 to b40a558 Compare October 12, 2021 16:57

oliver-sanders reviewed Oct 13, 2021

View reviewed changes

cylc/flow/task_remote_mgr.py Show resolved Hide resolved

oliver-sanders reviewed Oct 13, 2021

View reviewed changes

datamel reviewed Oct 13, 2021

View reviewed changes

cylc/flow/task_remote_mgr.py Outdated Show resolved Hide resolved

Fix spurious warning on remote tidy

a1c662d

datamel reviewed Oct 13, 2021

View reviewed changes

tests/functional/intelligent-host-selection/01-periodic-clear-badhosts.t Outdated Show resolved Hide resolved

MetRonnie added 2 commits October 13, 2021 13:37

Improve logging for remote tidy

474eb21

Add short sleep to remote tidy

19d7e59

Avoid maxing out CPU!

MetRonnie force-pushed the remote-tidy-fix branch from 3a7321b to 19d7e59 Compare October 13, 2021 12:38

oliver-sanders added bug Something is wrong :( and removed bug? Not sure if this is a bug or not labels Oct 13, 2021

oliver-sanders approved these changes Oct 14, 2021

View reviewed changes

hjoliver approved these changes Oct 15, 2021

View reviewed changes

hjoliver merged commit 21ff967 into cylc:master Oct 15, 2021

MetRonnie deleted the remote-tidy-fix branch October 15, 2021 15:26

MetRonnie changed the title ~~Attempt to address remote test flakiness~~ Fix error after remote tidy timeout Oct 21, 2021

MetRonnie mentioned this pull request Nov 1, 2021

speed up _remote_background_indep_poll tests #4433

Closed

MetRonnie linked an issue Nov 1, 2021 that may be closed by this pull request

speed up _remote_background_indep_poll tests #4433

Closed

hjoliver mentioned this pull request Dec 7, 2021

2021 Cylc Meetings cylc/cylc-admin#139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix error after remote tidy timeout #4465

Fix error after remote tidy timeout #4465

MetRonnie commented Oct 12, 2021 •

edited

Loading

MetRonnie Oct 12, 2021

hjoliver commented Oct 12, 2021

oliver-sanders Oct 13, 2021

MetRonnie Oct 13, 2021

oliver-sanders Oct 13, 2021

MetRonnie Oct 13, 2021

oliver-sanders Oct 13, 2021

MetRonnie commented Oct 13, 2021

datamel left a comment

hjoliver left a comment

		# Terminate any remaining commands
		for platform_n, (cmd, proc) in procs.items():

Fix error after remote tidy timeout #4465

Fix error after remote tidy timeout #4465

Conversation

MetRonnie commented Oct 12, 2021 • edited Loading

MetRonnie Oct 12, 2021

Choose a reason for hiding this comment

hjoliver commented Oct 12, 2021

oliver-sanders Oct 13, 2021

Choose a reason for hiding this comment

MetRonnie Oct 13, 2021

Choose a reason for hiding this comment

oliver-sanders Oct 13, 2021

Choose a reason for hiding this comment

MetRonnie Oct 13, 2021

Choose a reason for hiding this comment

oliver-sanders Oct 13, 2021

Choose a reason for hiding this comment

MetRonnie commented Oct 13, 2021

datamel left a comment

Choose a reason for hiding this comment

hjoliver left a comment

Choose a reason for hiding this comment

MetRonnie commented Oct 12, 2021 •

edited

Loading